feat: trigger Databricks GCS→bronze sync after file validation (Edvise/Legacy) by chapmanhk · Pull Request #239 · datakind/edvise-api

chapmanhk · 2026-05-18T21:42:49Z

feat(data): trigger Databricks GCS→bronze sync after file validation (Edvise/Legacy)

Description

After a successful file validation (POST .../input/validate-upload/{file_name} or SFTP validate path), the API starts the Databricks job edvise_validated_gcs_to_bronze_sync to copy the object from GCS validated/ into the institution's bronze volume (gcs_uploads). Validation and batch creation are unchanged; Databricks trigger failures are logged and do not fail the validation response.

The API waits only for Databricks jobs.run_now to accept the run and return a run id. It does not wait for cluster startup or the file copy to finish.

Behavior

Runs only for institutions with edvise_id or legacy_id (PDP-only institutions are skipped).
Uses existing Databricks auth (DATABRICKS_HOST_URL, GCP service account).
Resolves the job by optional DATABRICKS_VALIDATED_BRONZE_SYNC_JOB_ID, exact job name, DEV/STAGING job ID mapping, or a unique bundle-prefixed job name.
Structured JSON trace logs: validation_request, gcs_bronze_sync_background_start, gcs_bronze_sync_background_done with outcome (success | trigger_failed | skipped) and correlation_id for cross-log lookup.

New / updated

src/webapp/databricks.py — run_validated_gcs_to_bronze_sync, job resolution, bundle-aligned job parameters
src/webapp/routers/data.py — validation-time Databricks trigger in validation_helper
src/webapp/databricks_test.py, src/webapp/routers/data_test.py
src/webapp/.env.example — documents optional DATABRICKS_VALIDATED_BRONZE_SYNC_JOB_ID

Kill switch: ENABLE_GCS_BRONZE_SYNC_ON_VALIDATION=false (default: enabled).

Deployment Readiness*

Testing

Describe or check:

Created or updated unit, feature, and/or integration tests
Typical manual testing in the local env browser, dev pipeline, etc.

Automated: databricks_test.py (job ID resolution, DEV/STAGING mapping, bundle-prefixed job name resolution, run_now params contract); data_test.py (Edvise/Legacy trigger paths, PDP-only skip, env disabled, non-fatal Databricks trigger failure).

Manual (dev): Deployed feature branch to dev and validated upload as Legacy institution. Confirmed validation succeeded, the API selected the DEV Databricks job id, and the bronze sync job was triggered successfully. Verified expected logs include outcome:"success" and databricks_job_run_id; corresponding run is visible in Databricks Workflows.

Deployment Notes

Describe or check:

No special deployment steps required
Special deployment steps required

Rollback Plan

Describe or check:

Standard revert is sufficient (git revert)

Revert the merge commit. Optionally set ENABLE_GCS_BRONZE_SYNC_ON_VALIDATION=false immediately if a hot disable is needed before revert ships. Validation and existing Databricks flows are unaffected.

Reviewer Guidance / Questions*

Job parameters are pinned to the edvise bundle contract (github_validated_bronze_sync.yml); changes there need a matching API update.
This intentionally triggers Databricks during the validation request, but only waits for run_now to submit the job. It does not wait for the copy itself.
Databricks trigger errors are non-fatal to validation and are logged as outcome:"trigger_failed".
Job resolution includes DEV/STAGING deployed job IDs to handle current Databricks bundle naming differences.

Screenshots / Testing Evidence*

Expected success log:

{"event":"validation_request","correlation_id":"...","inst_id":"...","bucket":"...","file_name":"...","validation_source":"MANUAL_UPLOAD"}
{"event":"gcs_bronze_sync_background_start","correlation_id":"...","inst_id":"...","bucket":"...","file_name":"..."}
{"event":"gcs_bronze_sync_background_done","correlation_id":"...","outcome":"success","validated_blob_path":"validated/<file_name>","databricks_job_run_id":123,"databricks_job_name":"edvise_validated_gcs_to_bronze_sync"}

Databricks: corresponding run visible under Workflows → Jobs for [dev dev_cloudrun_sa] edvise_validated_gcs_to_bronze_sync.

SOC 2 Change Management Checklist

Provide justification if you are submitting a PR with any boxes checked other than the first.

Reminder for Reviewers: By approving this PR you are confirming that you have reviewed the code for correctness, security, and compliance with our engineering and SOC 2 standards. Do not approve PRs where SOC 2 checklist items are checked without documented justification.

To see the specific tasks where the Asana app for GitHub is being used, see below:
- https://app.asana.com/0/0/1214584270975391

- Add run_validated_gcs_to_bronze_sync and job edvise_validated_gcs_to_bronze_sync with include_blob_paths_json for validated/{file_name}. - Call after successful validate-upload / validate-sftp when edvise_id or legacy_id is set; ENABLE_GCS_BRONZE_SYNC_ON_VALIDATION (default true) to disable. - Failures to start the job are logged and do not fail validation. - Extend data tests with DatabricksControl mock and assertions. Made-with: Cursor

…lution Schedule GCS-to-bronze Databricks run_now in BackgroundTasks after validated/ writes. Add correlation_id and JSON trace logs (validation_request, background start/done). Optional DATABRICKS_VALIDATED_BRONZE_SYNC_JOB_ID; resolve job by name with duplicate detection when unset. Refine skip reasons for PDP vs Edvise/Legacy. Co-authored-by: Cursor <cursoragent@cursor.com>

Extract Databricks helpers and job-parameter constants, use specific exceptions (ValueError, DatabricksError), and split background logging into focused functions under 50 lines. Add tests for PDP-only and env kill-switch skips plus run_now parameter contract coverage. Co-authored-by: Cursor <cursoragent@cursor.com>

Co-authored-by: Cursor <cursoragent@cursor.com>

vishpillai123

nice error handling! Looks good

vishpillai123 · 2026-05-28T22:04:43Z

@chapmanhk has this been tested on dev webapp yet or still needs to be tested?

chapmanhk · 2026-05-29T15:36:35Z

@chapmanhk has this been tested on dev webapp yet or still needs to be tested?

It's been tested on the webapp!

* docs: inherit org community health files (#237) * docs: remove local community health files to inherit from org-wide .github repo * docs: update README to include previous contributing info * feat(api): simplify create model request to name only (#238) * chore: bump edvise dependency to 1.0.0 (#241) Co-authored-by: Vishakh Pillai <vishpillai97@gmail.com> * chore: creating dummy changlog.md file while we create semver / gitflow process * feat: trigger Databricks GCS→bronze sync after file validation (Edvise/Legacy) (#239) * Trigger GCS→bronze Databricks sync after validation (Edvise/Legacy) - Add run_validated_gcs_to_bronze_sync and job edvise_validated_gcs_to_bronze_sync with include_blob_paths_json for validated/{file_name}. - Call after successful validate-upload / validate-sftp when edvise_id or legacy_id is set; ENABLE_GCS_BRONZE_SYNC_ON_VALIDATION (default true) to disable. - Failures to start the job are logged and do not fail validation. - Extend data tests with DatabricksControl mock and assertions. Made-with: Cursor * feat(data): bronze sync after validation with tracing and job id resolution Schedule GCS-to-bronze Databricks run_now in BackgroundTasks after validated/ writes. Add correlation_id and JSON trace logs (validation_request, background start/done). Optional DATABRICKS_VALIDATED_BRONZE_SYNC_JOB_ID; resolve job by name with duplicate detection when unset. Refine skip reasons for PDP vs Edvise/Legacy. Co-authored-by: Cursor <cursoragent@cursor.com> * refactor(data): align bronze sync with universal principles Extract Databricks helpers and job-parameter constants, use specific exceptions (ValueError, DatabricksError), and split background logging into focused functions under 50 lines. Add tests for PDP-only and env kill-switch skips plus run_now parameter contract coverage. Co-authored-by: Cursor <cursoragent@cursor.com> * style: apply ruff format to bronze sync modules Co-authored-by: Cursor <cursoragent@cursor.com> * fix(data): resolve prefixed bronze sync Databricks jobs Co-authored-by: Cursor <cursoragent@cursor.com> * fix(data): map bronze sync job ids by environment Co-authored-by: Cursor <cursoragent@cursor.com> * fix(data): trigger bronze sync during validation request Co-authored-by: Cursor <cursoragent@cursor.com> --------- Co-authored-by: Cursor <cursoragent@cursor.com> * feat(api): validate Edvise uploads with repo schemas (#242) * feat(api): validate Edvise uploads with repo schemas Route Edvise student and course uploads through upstream edvise Pandera schemas so upload validation matches the pipeline contract and is not bypassed by registry schema drift. Co-authored-by: Cursor <cursoragent@cursor.com> * refactor(api): separate Edvise validation routing Keep PDP and Edvise repo-schema upload validation paths distinct so each helper has a single responsibility while preserving the same validation behavior. Co-authored-by: Cursor <cursoragent@cursor.com> * refactor(api): remove redundant repo validation fallback Keep JSON validation flow focused now that PDP and Edvise repo-schema uploads are routed before schema merging. Co-authored-by: Cursor <cursoragent@cursor.com> * style(api): format validation routing test Apply Ruff formatting to keep the Edvise validation routing tests passing style checks. Co-authored-by: Cursor <cursoragent@cursor.com> --------- Co-authored-by: Cursor <cursoragent@cursor.com> * feat(webapp): establish pyproject.toml as canonical Edvise API version (#243) * feat(webapp): establish pyproject.toml as canonical Edvise API version, set version and title in OpenAPI * docs: rename SST -> Edvise * docs(webapp): update formatter instructions to ruff to align with github workflows and engineering playbook * test(webapp): assert OpenAPI version matches pyproject.toml * feat(eda): add clear_cache option to /eda endpoint (#233) * feat: legacy school inference DB job trigger (#212) * feat: custom school inference, but need to confirm if custom is the same as legacy * fix: transitioning from 'custom' to 'legacy' * fix: remove validation of job parameters, handled already through edvise * fix: run request still requires str values, defaulting to empty string * fix: still getting pydantic error * feat: using substring matching to find legacy job since i have it deployed under my name because of target==dev * fix: style * fix: style * fix: style * fix: making batch file name more robust so we don't run into decoding issues * fix: merge conflict * fix: merge conflict --------- Co-authored-by: Vishakh Pillai <vishpillai97@gmail.com> * feat: add gen ai as 4th option in addition pdp / edvise / legacy in api and uploads (#244) * feat: Added "GenAI" as an option for "create institution" note: for GenAI raw files, we will reuse the same loose rules as Legacy institutions (read CSV, PII check, no strict ES columns). * fix: style --------- Co-authored-by: Noreen Mayat <nm3224@alum.barnard.edu> Co-authored-by: Vishakh Pillai <vishpillai97@gmail.com> * fix(models): derive PDP batch schema configs from institution schemas (#247) * fix(models): derive PDP batch schema configs from institution schemas When model.schema_configs is null, PDP inference now builds a default required COURSE+STUDENT (etc.) batch rule from inst.schemas instead of 500ing. Explicit model configs still take precedence. Co-authored-by: Cursor <cursoragent@cursor.com> * fix(models): cast jsonpickle decode for mypy no-any-return Co-authored-by: Cursor <cursoragent@cursor.com> * fix(databricks): prefer Cloud Run job when pipeline name is ambiguous When multiple dev bundle jobs match a PDP or legacy inference pipeline substring, resolve to [dev dev_cloudrun_sa] if present; otherwise use the first sorted match instead of failing the inference request. Co-authored-by: Cursor <cursoragent@cursor.com> --------- Co-authored-by: Vishakh Pillai <vishpillai97@gmail.com> Co-authored-by: Cursor <cursoragent@cursor.com> * ci: datakind shared workflows (#245) * ci: datakind shared workflows * refactor: rename test.yml -> tests.yml * fix(ci): add workflow_call to style and tests workflows * refactor: use pre-release workflow from shared workflows * ci: replace with shared enforce-pr-targets workflow Aligns checks against the current protected branches, main and develop, rather than staging * chore: remove unused workflow * refactor: remove pull_request triggers. These run via ci.yml * ci: pin tests and type-check to Python 3.13 * chore(ci): remove legacy webapp-and-worker precommit workflow * ci: standardize on Python 3.12 across workflows and pyproject * ci: test workflow enforcement * ci: test workflow enforcement * ci: add gate job to report required ci status check * chore: bump python version to 3.10 * chore: standardize Python 3.12 across project and Docker * chore: updating edvise v1.2.0 * chore: CHANGELOG.md update + type check * chore(release): bump version * ci(cloudbuild): parameterize webapp deploy for multi-environment triggers --------- Co-authored-by: Rachel Wells <rachellaurynwells@gmail.com> Co-authored-by: Vishakh Pillai <64162993+vishpillai123@users.noreply.github.com> Co-authored-by: Vishakh Pillai <vishpillai97@gmail.com> Co-authored-by: Hannah Ofstedahl <98632391+chapmanhk@users.noreply.github.com> Co-authored-by: Cursor <cursoragent@cursor.com> Co-authored-by: Noreen Mayat <nm3224@alum.barnard.edu>

* Merge pull request #217 from datakind/fix/pdp-course-handling-duplicates fix: fix duplicate-handling step in validation * fix(storage): reduce peak memory during upload validation - Download unvalidated blob to a temp file and validate by path instead of blob.open().read() via _path_for_edvise_read (avoids a full in-RAM copy). - Write validated CSV to a temp file and upload_from_filename instead of building the entire CSV in a StringIO string. Branched from develop (repo has no dev branch). Made-with: Cursor * chore(storage): log errno on temp download/to_csv OSError Helps distinguish ENOSPC vs other failures in Cloud Run logs; re-raises unchanged. Made-with: Cursor * test(storage): cover temp cleanup and OSError logging for validate upload - Download OSError: unlink temp, skip validate_file_reader, log errno - to_csv OSError: unlink temp, no upload, log errno - Upload failure after to_csv: temp still unlinked Made-with: Cursor * refactor(storage): extract temp download/unlink helpers for clarity Aligns with universal-principles: keep _run_validation_and_get_normalized_df under 50 lines, reduce nesting, replace tmp_path with local_csv_path naming. Made-with: Cursor * style: apply black/ruff format to gcsutil_test.py Made-with: Cursor * feat: consolidating staging into main and using main going forward as production (#234) * Feat: Added backfill endpoint * Fix: linting * added func description * added func description * added func description * added func description * added func description * added func description * added func description * feat: adjusted run output endpointto return model_run_id * Delete .DS_Store * Delete src/.DS_Store * Delete terraform/.DS_Store * feat: added model deletion endpoint * feat: added model deletion endpoint * feat: added model deletion endpoint * fix: linting * fix: linting * fix: linting * fix: linting * fix: linting * fix: linting * fix: linting * fixed model name malformation * fix: removed databricks deletion functionality * fix: removed query results not needed * fix: removed query results not needed * fix: added status * fix: added status * fix: formatting fix * fix: added query to retrieve model id * fix: added passive delete to db cascade so deleting the model ensures job runs are deleted * fix: removed extra db query for model id, since db now handles passive deletes * fix: formatting fix * fix: removed db mapping framework * fix: removed db mapping framework * fix: removed db mapping framework * fix: removed db mapping framework * feat: changed endpoint parameter name from experiment_run_id -> model_run_id * fix: type check errors * test batch and file data * eda endpoints * test data * eda calculations * eda year and term, course enrollemnts * eda degree types * fix: divide data category into a seperate front end table section * fix: linting * feat: developed function for adding custom jobs with institution and model validation * fix: linting errors * fix: linting errors * fix: changed route from GET to POST * fix: added output filename definition * fix: linting errors * eda test institution data * eda test institution * eda data * eda test data * allow missing eda data * eda enrollment type by intensity * eda pell recipient by 1st gen * eda student age by gender * eda pell status by race * eda tests * cache eda * tidy up * remove LOCAL test bucket setup * return List from get_term_counts * import pandas * remove unused variable * tidy up * eda bucket names * fix: type check errors * fix: type check errors * fix: type check errors * fix: formatting errors * fix: type check errors * fix: type check errors * fix: batch name renewal * fix: batch name renewal * fix: changed output_valid to true * fix: adjusted model card file path * fix: ensuring we are grabbing the most recent run for a model id * remove colors from /eda endpoint * return count and percentage in /eda degree_types * tidy up * fix: fix file format * fix: retrieve by model_run_id instead * fix: formatting * fix: validation error for worwic * fix: changed model name to model_run_id parameter * fix: added function to retrieve config.toml from select catalog * manually initialized course mappings * feat: added validation mapping * fix: formatting * fix: pylint * Ignore .cursor folder for personal cursor preferences * feat(schema): add Edvise schema definition * feat(institutions): add Edvise schema support Add Edvise schema support to institution management: - Add edvise_id field to InstTable and SchemaRegistryTable - Update create/update endpoints with Edvise support and validation - Add mutual exclusivity check (PDP vs Edvise) - Implement normalization for empty strings and whitespace - Remove redundant boolean flags (derive status from ID presence) - Add comprehensive test coverage (34 new test cases) All changes are backward compatible. * fix: resolve CI/CD test failures - Fix test_create_inst_with_edvise_success: use unique institution name to avoid UNIQUE constraint violation - Fix test_trigger_inference_run: add pdp_id to InstTable fixture in models_test.py - Fix code formatting: run ruff format on database.py, institutions.py, and institutions_test.py These fixes address the three issues that were causing CI/CD test failures: 1. UNIQUE constraint failed: inst.name in test_create_inst_with_edvise_success 2. Assertion error: expected 400 but got 501 in test_trigger_inference_run 3. Ruff format check failures * fix: resolve unique constraint conflicts in SchemaRegistryTable - Add doc_type to is_pdp and is_edvise unique constraints to allow base, PDP, and Edvise schemas to coexist with same version - Add CheckConstraint to enforce mutual exclusivity of is_pdp and is_edvise flags Fixes Bugbot issue: Unique constraint prevented coexisting schema types for same version. The original constraints (is_pdp, version_label) and (is_edvise, version_label) prevented base schema and PDP/Edvise extensions from sharing the same version label since they all had is_pdp=False and is_edvise=False. Adding doc_type to these constraints allows proper coexistence while maintaining uniqueness guarantees. Also adds database-level enforcement that is_pdp and is_edvise cannot both be True simultaneously. * fix: resolve mypy type errors - Fix type error in institutions.py: change set to list for requested_schemas default value - Add return type annotations to all test functions in institutions_test.py - Add return type annotations to fixture functions - Add typing.Any import for fixture return types Fixes mypy errors: incompatible types in assignment and missing return type annotations. * fix: add missing type annotations to test function parameters - Add TestClient type annotations to test_create_inst_unauth, test_create_inst, test_edit_inst, and test_delete_inst Fixes mypy errors: Function is missing a type annotation for one or more arguments. * feat: Implement Phase 3 Edvise schema validation logic - Add EDVISE_SCHEMA_GROUP constant to utilities.py (mirrors PDP_SCHEMA_GROUP) - Add _edvise_cache to _ValidationState class for schema caching with TTL - Update validation_helper() to load Edvise schema when edvise_id is set - Add defensive check for mutual exclusivity (pdp_id and edvise_id cannot both be set) - Add error handling for missing Edvise schema with clear error messages - Update institution creation endpoint to use EDVISE_SCHEMA_GROUP when edvise_id is provided - Add comprehensive test suite: 15 tests covering happy path, errors, cache, authorization, and edge cases This implementation enables institutions with edvise_id to use the Edvise schema extension for file validation, following the same pattern as PDP schema validation. All changes are backwards compatible and include comprehensive test coverage (~90% of critical paths). * fix: Resolve Edvise test failures and improve test reliability - Fix type annotation error in PDP schema branch (mypy no-redef) - Change test user to DATAKINDER for multi-institution access - Fix database constraint violation in precedence test (version_label) - Simplify cache tests to verify behavior instead of implementation - Remove duplicate assertion in cache expiration test - Optimize imports in test fixture * fix: Update Edvise test filenames to include descriptive keywords - Change generic test filenames (test.csv, test_file.csv, etc.) to include 'student' keyword - This allows validation_helper to properly infer model types from filenames - Fixes ValueError: Could not infer model(s) from file name errors - Formatting will be applied by CI ruff formatter * style: Format data_test.py with ruff * fix(validation): return proper HTTP status codes for institution errors - Change ValueError to HTTPException (404) when institution not found in validation_helper - Fix test_validate_edvise_unauthorized to test actual unauthorized access instead of non-existent institution - Ensures proper HTTP status codes are returned to API clients * fix: handle filename inference errors and extension schema deactivation - Replace ValueError with HTTPException (422) for filename inference failures to return proper user-facing error instead of 500 - Deactivate existing extension schemas before inserting new ones to ensure only one active extension per institution and prevent nondeterministic queries - Add comprehensive validation error formatter with PII masking and user-friendly messages - Add integration and snapshot tests for error formatter * fix: remove unused imports from validation_error_formatter_snapshot_test - Remove unused typing imports (Any, Dict, List) - Remove unused pandera imports (DataFrameSchema, Column, Check) - Remove unused MAX_ERROR_EXAMPLES import Fixes ruff linting errors (F401) reported in CI. * fix: resolve test failures and configuration issues - Remove invalid catalog_name parameter from create_custom_schema_extension call - Restore testpaths configuration to use src directory - Add Pandera FutureWarning filter to pytest config - Fix syntax warning in databricks.py docstring - Format files with Ruff * fix: resolve Ruff and Mypy linting errors - Remove unused imports (IO, cast, tomli/tomllib) from databricks.py - Remove duplicate import re statement - Add type annotations to test cases in validation_error_formatter_test.py - Add type: ignore comments for intentional invalid type tests * fix: align database constraints with production schema and fix Edvise version_label collision - Fix uq_pdp_version constraint to match production: remove doc_type (matches actual DB schema) - Remove uq_edvise_version constraint (enforced operationally, not via DB constraint) - Update CHECK constraint to use MySQL-compatible boolean values (1/0 instead of TRUE/FALSE) - Fix Edvise test fixture to use version_label='edvise-1.0.0' to avoid uq_pdp_version collision - Add explanatory comment about version_label choice in test fixture These changes ensure the ORM matches the actual production database schema and prevent constraint violations when running tests against MySQL. * fix: handle parameterized Pandera check types in validation error formatting Fix bug where parameterized check types (e.g., "isin(['A', 'B', 'C'])", "str_length(3, None)") were not being matched to their formatters, causing generic error messages instead of human-readable ones. Changes: - Add _extract_base_check_type() to extract base type from parameterized check types (e.g., "isin(['A', 'B'])" -> "isin") - Add _normalize_check_type_alias() to map verbose Pandera names to spec keys (e.g., "greater_than" -> "gt", "greater_than_or_equal_to" -> "ge") - Update _find_check_spec() to use base type extraction and alias normalization - Update _format_check_error() to only format when matching spec is found (prevents semantic errors like formatting "greater_than" as "ge") - Add _format_gt_error() and _format_lt_error() for strict comparison checks - Preserve semantic correctness: strict comparisons (> and <) vs non-strict (≥ and ≤) Edge cases handled: - Namespaced types: "Check.isin(['A'])" -> "isin" - Empty/None/non-string inputs: returns safe empty string - Spaces around parentheses: "isin (['A'])" -> "isin" - Complex repr: "str_matches(re.compile('...'))" -> "str_matches" Testing: - Add comprehensive unit tests for base type extraction and alias handling - Add tests for parameterized check types (isin, str_length, gt, ge) - Update integration test assertion to match actual output format - Update snapshot fixtures to reflect new human-readable messages Fixes parameterized check type matching while maintaining semantic correctness for strict vs non-strict comparisons. * style: format validation_error_formatter files with ruff Auto-formatted files to comply with project formatting standards. * feat: add case-insensitive institution name lookup - Implement case-insensitive matching for GET /institutions/name/{inst_name} endpoint - Use func.lower() on both database column and input parameter for case-insensitive comparison - Update docstring to document case-insensitive behavior and error handling - Add comprehensive test cases for case-insensitive matching: - Test multiple case variations (original, title case, uppercase, mixed case) - Test lowercase input matching database entries - Test uppercase input matching lowercase database entries - Fix type error: change requested_schemas assignment from set to list for type consistency - Apply code formatting with ruff * fix: add missing return type annotations to test functions - Add Generator import from typing for fixture return types - Add return type annotations (-> None) to all test functions: - test_read_all_inst - test_read_all_inst_datakinder - test_read_inst_by_name - test_read_inst_by_name_case_insensitive - test_read_inst_by_name_case_insensitive_lowercase - test_read_inst_by_name_case_insensitive_uppercase - test_read_inst_by_pdp_id - test_read_inst - Fix fixture return types to use Generator[TestClient, None, None] - client_fixture - datakinder_client_fixture - Resolves mypy type checking errors for test file * style: apply ruff formatting to test file - Split long function signatures across multiple lines for readability - Format client_fixture and datakinder_client_fixture function signatures - Format test_read_inst_by_name_case_insensitive_lowercase and _uppercase function signatures * fix(test): update institutions test for edvise_id API changes - Remove unused typing.Any import - Update test_read_all_inst_datakinder to include edvise_id in expected response - Add edvise_test_school institution to expected response (4 institutions total) - Fix line length for pylint compliance This fixes test failures caused by API changes from develop branch that now return edvise_id and pdp_id fields for all institutions. * fix(validation): pass institution_id so Edvise/PDP/custom use correct extension block - Thread schema_namespace (edvise | pdp | inst UUID) from data router through validate_file and validate_file_reader into validate_dataset - merge_model_columns now receives correct key for extension_schema['institutions'] - Add institution_id param with default 'pdp' for backward compatibility - Add tests: assert Edvise validation passes institution_id='edvise'; add unit test that institution_id selects the right extension block (edvise vs pdp) - Expand docstrings (Args/Returns) and add comment explaining schema_namespace - Addresses reviewer Q1: schema extension logic now works for Edvise and custom institutions, not only PDP * Apply Black formatting to institutions_test.py * Apply ruff format to institutions_test.py * Fix institutions_test assert for Black and Ruff format compatibility * Fix pylint E1135 in data_test: use .get() instead of membership test on captured_schema * Apply ruff format to data_test.py * feat(validation): schema validation during upload with PDP/edvise repo alignment - Add PDP edvise schema validation path (validation_pdp_edvise) - Add Edvise-to-PDP normalization (validation_edvise_normalize) - Integrate repo schemas into validation pipeline and error formatter - Update pdp_schema_extension and lockfile; add tests Co-authored-by: Cursor <cursoragent@cursor.com> * feat(validation): write normalized data to validated/, archive raw to raw/ - On validation success: archive original to raw/{filename}, write normalized (canonical columns, coerced dtypes) DataFrame to validated/{filename}, delete from unvalidated/ - Validation layer always returns normalized_df on success; storage serializes to UTF-8 CSV and uploads to validated/ - Add input validation and helpers in gcsutil (under 50 lines); catch specific exceptions; TYPE_CHECKING for HardValidationError in validation_pdp_edvise - Add gcsutil_test.py: validate_file input/error/success paths, _run_validation_and_get_normalized_df, _write_dataframe_to_gcs_as_csv - Add validation_test: empty-schema short-circuit returns normalized_df None - Ruff/black formatting and lint fixes; mypy-clean for touched files Co-authored-by: Cursor <cursoragent@cursor.com> * refactor(validation): align with universal principles, add tests, fix types and format - Extract validation helpers to meet 50-line rule (_header_missing_and_extra, _get_csv_read_kwargs, _validate_optional_columns_json) - Extract gcsutil._archive_raw_and_write_validated; add type hints to rename_file - Add tests: PDP rename/validate_dataframe, CSV read failure, gcsutil error propagation, edvise institution_identifier in validate_file call - Remove unused validation_edvise_normalize and its tests - Fix mypy in validation_pdp_edvise and tests (Optional[List], cast, annotations) - Apply ruff format Co-authored-by: Cursor <cursoragent@cursor.com> * feat(validation): use edvise read for PDP uploads and add PDP path tests - Route PDP cohort/course through edvise read (read_raw_pdp_*); remove API-side normalizers for PDP so pipeline and API share one source of truth - Add _path_for_edvise_read, _read_pdp_course_edvise, _validate_pdp_with_edvise_read - Convert Pandera SchemaErrors to HardValidationError in PDP path - Add validation_pdp_read_path_test.py (routing, path cleanup, SchemaErrors, course converter fallback); extend Src type with io.StringIO for file-like Co-authored-by: Cursor <cursoragent@cursor.com> * move cloud build config to repo * sst-app-api -> edvise-api * quiet down sqlalchemy * use EdaSummary from edvise * use ruff formatter * test a file * tidy up * Add return type annotations for mypy in main_test and users_test * tidy up * move cache check after batch result check * fix test_execute_pdp_pull * install git * install git in correct Dockerfile * install git in worker * update edvise branch * use develop branch for edvise * install edvise in build * cloudbuild with edvise * fix(validation): resolve pylint used-before-assignment error Initialize schema_err_to_raise before try block to satisfy pylint's static analysis, which doesn't recognize that pytest.skip() always raises. Co-authored-by: Cursor <cursoragent@cursor.com> * feat(api): add legacy school type with any-format uploads - Add legacy_id to InstTable and institution API (create, update, read) - Enforce mutual exclusivity of pdp_id, edvise_id, legacy_id via has_at_most_one_school_type - Legacy validation: encoding + CSV read only, no schema checks - Add LEGACY_SCHEMA_GROUP and tests for legacy path and mutual exclusivity Made-with: Cursor * feat(api): legacy PII check, principles compliance, and test coverage - Add PII column check for legacy uploads; reject before raw/validated - Treat student_id as non-PII (false positive) for all institution types - Comply with universal principles: docstrings, extract create_institution helpers (<50 lines), comment lazy import in validation - Add tests: has_at_most_one_school_type, legacy header-only CSV, legacy PII rejection returns 400, explicit legacy_id create, update add legacy_id, storage/Databricks failure paths - Fix mypy in create_institution (row variable) Made-with: Cursor * docs(api): use Edvise Schema (ES) naming to reduce confusion Replace 'Edvise schema' with 'Edvise Schema (ES)' in docstrings, comments, and user-facing error messages so the schema type is distinguished from the Edvise product (ES convention). Made-with: Cursor * feat(data): allow legacy institutions to upload files with any filename - Fetch institution before filename inference; set allowed_schemas to UNKNOWN when inference fails for legacy (non-legacy still get 422 for non-descriptive names) - Refactor validation_helper into helpers under 50 lines; add full docstrings, early empty-filename and invalid inst_id validation, log before 404 - Add unit tests for _infer_allowed_schemas_from_filename and _ext_models_set - Add integration tests: empty filename 422, invalid inst_id 404, edvise non-descriptive filename 422, duplicate validate idempotent - Fix mypy and ruff/black in data.py and data_test.py - Add PR_DESCRIPTION.md for feature branch Made-with: Cursor * chore: remove PR_DESCRIPTION.md Made-with: Cursor * fix(validation): run PII check for header-only legacy CSVs * fix(test): align validation error snapshot with non-PII student_id display Made-with: Cursor * feat(validation): use PDP cohort converter and support custom converters - Use converter_func_cohort by default for PDP cohort validation (filters DE/DS/SE) - Add optional pdp_cohort_converter_func and pdp_course_converter_func to validate_file_reader and validate_dataset for school-specific overrides - Course validation tries custom converter first, then default handling_duplicates - Validate converter args are callable; convert converter/read failures to HardValidationError so API returns 400 with context - Add PDPConverterFunc type; extract helpers to meet 50-line and error-handling rules Made-with: Cursor * fix(validation): satisfy mypy for PDP validation and tests - Add unreachable return after with block in _validate_pdp_with_edvise_read - Use cast(Any, ...) in tests that pass non-callables to converter params Made-with: Cursor * chore: remove real institution names * chore: ruff format * fix: use latest edvise EdaSummary * fix: use edvise develop branch * chore(deps): pin edvise to develop * feat(ci): notify slack channel on deployment * fix: lock file was out of sync * chore: bump edvise version to 0.1.12 * Revert "feat: legacy school type with any-format uploads, PII check, and Edvise Schema (ES) naming" * Revert "Revert "feat: legacy school type with any-format uploads, PII check, and Edvise Schema (ES) naming"" * feat(config): add optional local inst/batch/file seed from config for LOCAL * style: ruff format * fix(validation): pass schema_type to handling_duplicates for PDP course CSV read_raw_pdp_course_data calls converter_func(df) with one argument; bare handling_duplicates is invalid on current edvise. Use a wrapper that calls handling_duplicates(df, "pdp") positionally for edvise compatibility. Remove the broken second default converter. Update PDP read path test. Made-with: Cursor * style: ruff format PDP course read path test Made-with: Cursor * fix(deps): upgrade databricks-sql-connector for pyarrow>=17 (edvise) databricks-sql-connector 3.5 pins pyarrow<17; edvise requires pyarrow>=17. Use databricks-sql-connector[pyarrow]~=4.2.x and refresh uv.lock (pyarrow 19). Aligns lock with Cloud Build 'uv lock --upgrade-package edvise'. Made-with: Cursor * Merge pull request #217 from datakind/fix/pdp-course-handling-duplicates fix: fix duplicate-handling step in validation * fix(storage): reduce peak memory during upload validation - Download unvalidated blob to a temp file and validate by path instead of blob.open().read() via _path_for_edvise_read (avoids a full in-RAM copy). - Write validated CSV to a temp file and upload_from_filename instead of building the entire CSV in a StringIO string. Branched from develop (repo has no dev branch). Made-with: Cursor * chore(storage): log errno on temp download/to_csv OSError Helps distinguish ENOSPC vs other failures in Cloud Run logs; re-raises unchanged. Made-with: Cursor * test(storage): cover temp cleanup and OSError logging for validate upload - Download OSError: unlink temp, skip validate_file_reader, log errno - to_csv OSError: unlink temp, no upload, log errno - Upload failure after to_csv: temp still unlinked Made-with: Cursor * refactor(storage): extract temp download/unlink helpers for clarity Aligns with universal-principles: keep _run_validation_and_get_normalized_df under 50 lines, reduce nesting, replace tmp_path with local_csv_path naming. Made-with: Cursor * style: apply black/ruff format to gcsutil_test.py Made-with: Cursor * fix(storage): reduce peak memory during upload validation - Download unvalidated blob to a temp file and validate by path instead of blob.open().read() via _path_for_edvise_read (avoids a full in-RAM copy). - Write validated CSV to a temp file and upload_from_filename instead of building the entire CSV in a StringIO string. Branched from develop (repo has no dev branch). Made-with: Cursor * chore(storage): log errno on temp download/to_csv OSError Helps distinguish ENOSPC vs other failures in Cloud Run logs; re-raises unchanged. Made-with: Cursor * test(storage): cover temp cleanup and OSError logging for validate upload - Download OSError: unlink temp, skip validate_file_reader, log errno - to_csv OSError: unlink temp, no upload, log errno - Upload failure after to_csv: temp still unlinked Made-with: Cursor * refactor(storage): extract temp download/unlink helpers for clarity Aligns with universal-principles: keep _run_validation_and_get_normalized_df under 50 lines, reduce nesting, replace tmp_path with local_csv_path naming. Made-with: Cursor * style: apply black/ruff format to gcsutil_test.py Made-with: Cursor * chore: bump edvise v0.2.0 * fix(pdp-validation): default cohort converter to none Stop passing edvise converter_func_cohort when pdp_cohort_converter_func is omitted so PDP cohort rows are validated as read. - Callers may still pass an explicit cohort converter. - Update PDP read-path test to expect converter_func=None. - Refresh docstrings (pipeline vs API, Args/Returns/Raises) in validation and validation_pdp_edvise. Made-with: Cursor * feat(api): remove custom institution path; require school type; legacy schemas UNKNOWN - Require exactly one of PDP, Edvise, or Legacy on POST /institutions - Remove custom schema resolution and Databricks extension generation for uploads - Fix PATCH /institutions to persist allowed_schemas to inst.schemas column - LEGACY_SCHEMA_GROUP stores UNKNOWN only; drop validation_extension module - Update tests and default fixtures for typeless/custom removal Made-with: Cursor * feat(api): harden institutions API after custom-institution removal - POST/PATCH: require exactly one school type (pdp, edvise, or legacy) - PATCH: recompute schemas only when the type triple changes; merge optional allowed_schemas on change - PATCH: honor is_edvise/is_legacy for auto-assigned ids (POST parity) - Docs/tests: validation namespaces; disambiguate custom naming in code and tests Made-with: Cursor * docs(api): revert broad custom wording; keep upload docs accurate Restore original docstrings and test names where "custom" referred to\nconverters, schema config, or JSON keys—not custom institutions.\n\nKeep gcsutil validate_file institution_id line aligned with pdp/edvise/legacy\nonly (no institution-UUID-for-custom upload path). Made-with: Cursor * fix(institutions): reject POST duplicate when existing row lacks school type When (name, state) matches an existing InstTable row, validate stored\npdp_id/edvise_id/legacy_id the same as new creates: at most one non-null\nand exactly one required. Return 400 with guidance instead of 200 for\ntypeless or invalid rows. Add regression tests. Made-with: Cursor * test(institutions): cover duplicate POST, PATCH flags, allowed_schemas-only - Reject is_pdp without pdp_id on POST\n- Reject duplicate (name, state) when stored row has conflicting ids\n- Reject PATCH is_edvise on PDP row without clearing pdp_id\n- Reject PATCH with both is_edvise and is_legacy\n- allowed_schemas-only PATCH replaces schemas when type unchanged Made-with: Cursor * refactor(institutions): extract PATCH helpers and DRY school-type errors - Add shared mutual-exclusion detail constant for POST/PATCH paths - Extract duplicate-post row validation and PATCH merge/validate/persist helpers - Keep update_inst within single-responsibility helpers; reuse row response mapper Made-with: Cursor * fix(lint): satisfy ruff and mypy on databricks and institutions - Remove unused HTTPException import from databricks.py (F401) - Cast ORM row in _require_single_institution_row_by_uuid for InstTable (no-any-return) Made-with: Cursor * style(institutions): apply ruff format to router and tests Made-with: Cursor * refactor: simplify local_inst_data * docs: Update local_inst_data instructions * chore: remove unused import * fix: make pdp_id and state optional * chore: bumping pyproject and uv.lock --------- Co-authored-by: Mesh <meshach.ogunmodede@datakind.org> Co-authored-by: Meshach Ogunmodede <142531479+Mesh-ach@users.noreply.github.com> Co-authored-by: William Carr <bill.carr@datakind.org> Co-authored-by: Hannah Ofstedahl <98632391+chapmanhk@users.noreply.github.com> Co-authored-by: Hannah Ofstedahl <hannahxchapman@yahoo.com> Co-authored-by: Cursor <cursoragent@cursor.com> Co-authored-by: William Carr <bill@datakind.org> Co-authored-by: Vishakh Pillai <vishpillai97@gmail.com> Co-authored-by: kaylawilding <95330483+kaylawilding@users.noreply.github.com> * Revert "feat: consolidating staging into main and using main going forward as…" (#236) This reverts commit 9b70f23. * Merge develop into main (#240) * docs: inherit org community health files (#237) * docs: remove local community health files to inherit from org-wide .github repo * docs: update README to include previous contributing info * feat(api): simplify create model request to name only (#238) * chore: bump edvise dependency to 1.0.0 (#241) Co-authored-by: Vishakh Pillai <vishpillai97@gmail.com> * chore: creating dummy changlog.md file while we create semver / gitflow process --------- Co-authored-by: Rachel Wells <rachellaurynwells@gmail.com> Co-authored-by: William Carr <bill@datakind.org> Co-authored-by: Vishakh Pillai <vishpillai97@gmail.com> * chore(release): edvise-api 1.0.0 (#249) * docs: inherit org community health files (#237) * docs: remove local community health files to inherit from org-wide .github repo * docs: update README to include previous contributing info * feat(api): simplify create model request to name only (#238) * chore: bump edvise dependency to 1.0.0 (#241) Co-authored-by: Vishakh Pillai <vishpillai97@gmail.com> * chore: creating dummy changlog.md file while we create semver / gitflow process * feat: trigger Databricks GCS→bronze sync after file validation (Edvise/Legacy) (#239) * Trigger GCS→bronze Databricks sync after validation (Edvise/Legacy) - Add run_validated_gcs_to_bronze_sync and job edvise_validated_gcs_to_bronze_sync with include_blob_paths_json for validated/{file_name}. - Call after successful validate-upload / validate-sftp when edvise_id or legacy_id is set; ENABLE_GCS_BRONZE_SYNC_ON_VALIDATION (default true) to disable. - Failures to start the job are logged and do not fail validation. - Extend data tests with DatabricksControl mock and assertions. Made-with: Cursor * feat(data): bronze sync after validation with tracing and job id resolution Schedule GCS-to-bronze Databricks run_now in BackgroundTasks after validated/ writes. Add correlation_id and JSON trace logs (validation_request, background start/done). Optional DATABRICKS_VALIDATED_BRONZE_SYNC_JOB_ID; resolve job by name with duplicate detection when unset. Refine skip reasons for PDP vs Edvise/Legacy. Co-authored-by: Cursor <cursoragent@cursor.com> * refactor(data): align bronze sync with universal principles Extract Databricks helpers and job-parameter constants, use specific exceptions (ValueError, DatabricksError), and split background logging into focused functions under 50 lines. Add tests for PDP-only and env kill-switch skips plus run_now parameter contract coverage. Co-authored-by: Cursor <cursoragent@cursor.com> * style: apply ruff format to bronze sync modules Co-authored-by: Cursor <cursoragent@cursor.com> * fix(data): resolve prefixed bronze sync Databricks jobs Co-authored-by: Cursor <cursoragent@cursor.com> * fix(data): map bronze sync job ids by environment Co-authored-by: Cursor <cursoragent@cursor.com> * fix(data): trigger bronze sync during validation request Co-authored-by: Cursor <cursoragent@cursor.com> --------- Co-authored-by: Cursor <cursoragent@cursor.com> * feat(api): validate Edvise uploads with repo schemas (#242) * feat(api): validate Edvise uploads with repo schemas Route Edvise student and course uploads through upstream edvise Pandera schemas so upload validation matches the pipeline contract and is not bypassed by registry schema drift. Co-authored-by: Cursor <cursoragent@cursor.com> * refactor(api): separate Edvise validation routing Keep PDP and Edvise repo-schema upload validation paths distinct so each helper has a single responsibility while preserving the same validation behavior. Co-authored-by: Cursor <cursoragent@cursor.com> * refactor(api): remove redundant repo validation fallback Keep JSON validation flow focused now that PDP and Edvise repo-schema uploads are routed before schema merging. Co-authored-by: Cursor <cursoragent@cursor.com> * style(api): format validation routing test Apply Ruff formatting to keep the Edvise validation routing tests passing style checks. Co-authored-by: Cursor <cursoragent@cursor.com> --------- Co-authored-by: Cursor <cursoragent@cursor.com> * feat(webapp): establish pyproject.toml as canonical Edvise API version (#243) * feat(webapp): establish pyproject.toml as canonical Edvise API version, set version and title in OpenAPI * docs: rename SST -> Edvise * docs(webapp): update formatter instructions to ruff to align with github workflows and engineering playbook * test(webapp): assert OpenAPI version matches pyproject.toml * feat(eda): add clear_cache option to /eda endpoint (#233) * feat: legacy school inference DB job trigger (#212) * feat: custom school inference, but need to confirm if custom is the same as legacy * fix: transitioning from 'custom' to 'legacy' * fix: remove validation of job parameters, handled already through edvise * fix: run request still requires str values, defaulting to empty string * fix: still getting pydantic error * feat: using substring matching to find legacy job since i have it deployed under my name because of target==dev * fix: style * fix: style * fix: style * fix: making batch file name more robust so we don't run into decoding issues * fix: merge conflict * fix: merge conflict --------- Co-authored-by: Vishakh Pillai <vishpillai97@gmail.com> * feat: add gen ai as 4th option in addition pdp / edvise / legacy in api and uploads (#244) * feat: Added "GenAI" as an option for "create institution" note: for GenAI raw files, we will reuse the same loose rules as Legacy institutions (read CSV, PII check, no strict ES columns). * fix: style --------- Co-authored-by: Noreen Mayat <nm3224@alum.barnard.edu> Co-authored-by: Vishakh Pillai <vishpillai97@gmail.com> * fix(models): derive PDP batch schema configs from institution schemas (#247) * fix(models): derive PDP batch schema configs from institution schemas When model.schema_configs is null, PDP inference now builds a default required COURSE+STUDENT (etc.) batch rule from inst.schemas instead of 500ing. Explicit model configs still take precedence. Co-authored-by: Cursor <cursoragent@cursor.com> * fix(models): cast jsonpickle decode for mypy no-any-return Co-authored-by: Cursor <cursoragent@cursor.com> * fix(databricks): prefer Cloud Run job when pipeline name is ambiguous When multiple dev bundle jobs match a PDP or legacy inference pipeline substring, resolve to [dev dev_cloudrun_sa] if present; otherwise use the first sorted match instead of failing the inference request. Co-authored-by: Cursor <cursoragent@cursor.com> --------- Co-authored-by: Vishakh Pillai <vishpillai97@gmail.com> Co-authored-by: Cursor <cursoragent@cursor.com> * ci: datakind shared workflows (#245) * ci: datakind shared workflows * refactor: rename test.yml -> tests.yml * fix(ci): add workflow_call to style and tests workflows * refactor: use pre-release workflow from shared workflows * ci: replace with shared enforce-pr-targets workflow Aligns checks against the current protected branches, main and develop, rather than staging * chore: remove unused workflow * refactor: remove pull_request triggers. These run via ci.yml * ci: pin tests and type-check to Python 3.13 * chore(ci): remove legacy webapp-and-worker precommit workflow * ci: standardize on Python 3.12 across workflows and pyproject * ci: test workflow enforcement * ci: test workflow enforcement * ci: add gate job to report required ci status check * chore: bump python version to 3.10 * chore: standardize Python 3.12 across project and Docker * chore: updating edvise v1.2.0 * chore: CHANGELOG.md update + type check * chore(release): bump version * ci(cloudbuild): parameterize webapp deploy for multi-environment triggers --------- Co-authored-by: Rachel Wells <rachellaurynwells@gmail.com> Co-authored-by: Vishakh Pillai <64162993+vishpillai123@users.noreply.github.com> Co-authored-by: Vishakh Pillai <vishpillai97@gmail.com> Co-authored-by: Hannah Ofstedahl <98632391+chapmanhk@users.noreply.github.com> Co-authored-by: Cursor <cursoragent@cursor.com> Co-authored-by: Noreen Mayat <nm3224@alum.barnard.edu> --------- Co-authored-by: Meshach Ogunmodede <142531479+Mesh-ach@users.noreply.github.com> Co-authored-by: Hannah Ofstedahl <98632391+chapmanhk@users.noreply.github.com> Co-authored-by: Hannah Ofstedahl <hannahxchapman@yahoo.com> Co-authored-by: Mesh <meshach.ogunmodede@datakind.org> Co-authored-by: William Carr <bill.carr@datakind.org> Co-authored-by: Cursor <cursoragent@cursor.com> Co-authored-by: William Carr <bill@datakind.org> Co-authored-by: Vishakh Pillai <vishpillai97@gmail.com> Co-authored-by: kaylawilding <95330483+kaylawilding@users.noreply.github.com> Co-authored-by: Rachel Wells <rachellaurynwells@gmail.com> Co-authored-by: Noreen Mayat <nm3224@alum.barnard.edu>

* docs: inherit org community health files (#237) * docs: remove local community health files to inherit from org-wide .github repo * docs: update README to include previous contributing info * feat(api): simplify create model request to name only (#238) * chore: bump edvise dependency to 1.0.0 (#241) Co-authored-by: Vishakh Pillai <vishpillai97@gmail.com> * chore: creating dummy changlog.md file while we create semver / gitflow process * feat: trigger Databricks GCS→bronze sync after file validation (Edvise/Legacy) (#239) * Trigger GCS→bronze Databricks sync after validation (Edvise/Legacy) - Add run_validated_gcs_to_bronze_sync and job edvise_validated_gcs_to_bronze_sync with include_blob_paths_json for validated/{file_name}. - Call after successful validate-upload / validate-sftp when edvise_id or legacy_id is set; ENABLE_GCS_BRONZE_SYNC_ON_VALIDATION (default true) to disable. - Failures to start the job are logged and do not fail validation. - Extend data tests with DatabricksControl mock and assertions. Made-with: Cursor * feat(data): bronze sync after validation with tracing and job id resolution Schedule GCS-to-bronze Databricks run_now in BackgroundTasks after validated/ writes. Add correlation_id and JSON trace logs (validation_request, background start/done). Optional DATABRICKS_VALIDATED_BRONZE_SYNC_JOB_ID; resolve job by name with duplicate detection when unset. Refine skip reasons for PDP vs Edvise/Legacy. Co-authored-by: Cursor <cursoragent@cursor.com> * refactor(data): align bronze sync with universal principles Extract Databricks helpers and job-parameter constants, use specific exceptions (ValueError, DatabricksError), and split background logging into focused functions under 50 lines. Add tests for PDP-only and env kill-switch skips plus run_now parameter contract coverage. Co-authored-by: Cursor <cursoragent@cursor.com> * style: apply ruff format to bronze sync modules Co-authored-by: Cursor <cursoragent@cursor.com> * fix(data): resolve prefixed bronze sync Databricks jobs Co-authored-by: Cursor <cursoragent@cursor.com> * fix(data): map bronze sync job ids by environment Co-authored-by: Cursor <cursoragent@cursor.com> * fix(data): trigger bronze sync during validation request Co-authored-by: Cursor <cursoragent@cursor.com> --------- Co-authored-by: Cursor <cursoragent@cursor.com> * feat(api): validate Edvise uploads with repo schemas (#242) * feat(api): validate Edvise uploads with repo schemas Route Edvise student and course uploads through upstream edvise Pandera schemas so upload validation matches the pipeline contract and is not bypassed by registry schema drift. Co-authored-by: Cursor <cursoragent@cursor.com> * refactor(api): separate Edvise validation routing Keep PDP and Edvise repo-schema upload validation paths distinct so each helper has a single responsibility while preserving the same validation behavior. Co-authored-by: Cursor <cursoragent@cursor.com> * refactor(api): remove redundant repo validation fallback Keep JSON validation flow focused now that PDP and Edvise repo-schema uploads are routed before schema merging. Co-authored-by: Cursor <cursoragent@cursor.com> * style(api): format validation routing test Apply Ruff formatting to keep the Edvise validation routing tests passing style checks. Co-authored-by: Cursor <cursoragent@cursor.com> --------- Co-authored-by: Cursor <cursoragent@cursor.com> * feat(webapp): establish pyproject.toml as canonical Edvise API version (#243) * feat(webapp): establish pyproject.toml as canonical Edvise API version, set version and title in OpenAPI * docs: rename SST -> Edvise * docs(webapp): update formatter instructions to ruff to align with github workflows and engineering playbook * test(webapp): assert OpenAPI version matches pyproject.toml * feat(eda): add clear_cache option to /eda endpoint (#233) * feat: legacy school inference DB job trigger (#212) * feat: custom school inference, but need to confirm if custom is the same as legacy * fix: transitioning from 'custom' to 'legacy' * fix: remove validation of job parameters, handled already through edvise * fix: run request still requires str values, defaulting to empty string * fix: still getting pydantic error * feat: using substring matching to find legacy job since i have it deployed under my name because of target==dev * fix: style * fix: style * fix: style * fix: making batch file name more robust so we don't run into decoding issues * fix: merge conflict * fix: merge conflict --------- Co-authored-by: Vishakh Pillai <vishpillai97@gmail.com> * feat: add gen ai as 4th option in addition pdp / edvise / legacy in api and uploads (#244) * feat: Added "GenAI" as an option for "create institution" note: for GenAI raw files, we will reuse the same loose rules as Legacy institutions (read CSV, PII check, no strict ES columns). * fix: style --------- Co-authored-by: Noreen Mayat <nm3224@alum.barnard.edu> Co-authored-by: Vishakh Pillai <vishpillai97@gmail.com> * fix(models): derive PDP batch schema configs from institution schemas (#247) * fix(models): derive PDP batch schema configs from institution schemas When model.schema_configs is null, PDP inference now builds a default required COURSE+STUDENT (etc.) batch rule from inst.schemas instead of 500ing. Explicit model configs still take precedence. Co-authored-by: Cursor <cursoragent@cursor.com> * fix(models): cast jsonpickle decode for mypy no-any-return Co-authored-by: Cursor <cursoragent@cursor.com> * fix(databricks): prefer Cloud Run job when pipeline name is ambiguous When multiple dev bundle jobs match a PDP or legacy inference pipeline substring, resolve to [dev dev_cloudrun_sa] if present; otherwise use the first sorted match instead of failing the inference request. Co-authored-by: Cursor <cursoragent@cursor.com> --------- Co-authored-by: Vishakh Pillai <vishpillai97@gmail.com> Co-authored-by: Cursor <cursoragent@cursor.com> * ci: datakind shared workflows (#245) * ci: datakind shared workflows * refactor: rename test.yml -> tests.yml * fix(ci): add workflow_call to style and tests workflows * refactor: use pre-release workflow from shared workflows * ci: replace with shared enforce-pr-targets workflow Aligns checks against the current protected branches, main and develop, rather than staging * chore: remove unused workflow * refactor: remove pull_request triggers. These run via ci.yml * ci: pin tests and type-check to Python 3.13 * chore(ci): remove legacy webapp-and-worker precommit workflow * ci: standardize on Python 3.12 across workflows and pyproject * ci: test workflow enforcement * ci: test workflow enforcement * ci: add gate job to report required ci status check * chore(release): sync develop with main (#251) * Merge pull request #217 from datakind/fix/pdp-course-handling-duplicates fix: fix duplicate-handling step in validation * fix(storage): reduce peak memory during upload validation - Download unvalidated blob to a temp file and validate by path instead of blob.open().read() via _path_for_edvise_read (avoids a full in-RAM copy). - Write validated CSV to a temp file and upload_from_filename instead of building the entire CSV in a StringIO string. Branched from develop (repo has no dev branch). Made-with: Cursor * chore(storage): log errno on temp download/to_csv OSError Helps distinguish ENOSPC vs other failures in Cloud Run logs; re-raises unchanged. Made-with: Cursor * test(storage): cover temp cleanup and OSError logging for validate upload - Download OSError: unlink temp, skip validate_file_reader, log errno - to_csv OSError: unlink temp, no upload, log errno - Upload failure after to_csv: temp still unlinked Made-with: Cursor * refactor(storage): extract temp download/unlink helpers for clarity Aligns with universal-principles: keep _run_validation_and_get_normalized_df under 50 lines, reduce nesting, replace tmp_path with local_csv_path naming. Made-with: Cursor * style: apply black/ruff format to gcsutil_test.py Made-with: Cursor * feat: consolidating staging into main and using main going forward as production (#234) * Feat: Added backfill endpoint * Fix: linting * added func description * added func description * added func description * added func description * added func description * added func description * added func description * feat: adjusted run output endpointto return model_run_id * Delete .DS_Store * Delete src/.DS_Store * Delete terraform/.DS_Store * feat: added model deletion endpoint * feat: added model deletion endpoint * feat: added model deletion endpoint * fix: linting * fix: linting * fix: linting * fix: linting * fix: linting * fix: linting * fix: linting * fixed model name malformation * fix: removed databricks deletion functionality * fix: removed query results not needed * fix: removed query results not needed * fix: added status * fix: added status * fix: formatting fix * fix: added query to retrieve model id * fix: added passive delete to db cascade so deleting the model ensures job runs are deleted * fix: removed extra db query for model id, since db now handles passive deletes * fix: formatting fix * fix: removed db mapping framework * fix: removed db mapping framework * fix: removed db mapping framework * fix: removed db mapping framework * feat: changed endpoint parameter name from experiment_run_id -> model_run_id * fix: type check errors * test batch and file data * eda endpoints * test data * eda calculations * eda year and term, course enrollemnts * eda degree types * fix: divide data category into a seperate front end table section * fix: linting * feat: developed function for adding custom jobs with institution and model validation * fix: linting errors * fix: linting errors * fix: changed route from GET to POST * fix: added output filename definition * fix: linting errors * eda test institution data * eda test institution * eda data * eda test data * allow missing eda data * eda enrollment type by intensity * eda pell recipient by 1st gen * eda student age by gender * eda pell status by race * eda tests * cache eda * tidy up * remove LOCAL test bucket setup * return List from get_term_counts * import pandas * remove unused variable * tidy up * eda bucket names * fix: type check errors * fix: type check errors * fix: type check errors * fix: formatting errors * fix: type check errors * fix: type check errors * fix: batch name renewal * fix: batch name renewal * fix: changed output_valid to true * fix: adjusted model card file path * fix: ensuring we are grabbing the most recent run for a model id * remove colors from /eda endpoint * return count and percentage in /eda degree_types * tidy up * fix: fix file format * fix: retrieve by model_run_id instead * fix: formatting * fix: validation error for worwic * fix: changed model name to model_run_id parameter * fix: added function to retrieve config.toml from select catalog * manually initialized course mappings * feat: added validation mapping * fix: formatting * fix: pylint * Ignore .cursor folder for personal cursor preferences * feat(schema): add Edvise schema definition * feat(institutions): add Edvise schema support Add Edvise schema support to institution management: - Add edvise_id field to InstTable and SchemaRegistryTable - Update create/update endpoints with Edvise support and validation - Add mutual exclusivity check (PDP vs Edvise) - Implement normalization for empty strings and whitespace - Remove redundant boolean flags (derive status from ID presence) - Add comprehensive test coverage (34 new test cases) All changes are backward compatible. * fix: resolve CI/CD test failures - Fix test_create_inst_with_edvise_success: use unique institution name to avoid UNIQUE constraint violation - Fix test_trigger_inference_run: add pdp_id to InstTable fixture in models_test.py - Fix code formatting: run ruff format on database.py, institutions.py, and institutions_test.py These fixes address the three issues that were causing CI/CD test failures: 1. UNIQUE constraint failed: inst.name in test_create_inst_with_edvise_success 2. Assertion error: expected 400 but got 501 in test_trigger_inference_run 3. Ruff format check failures * fix: resolve unique constraint conflicts in SchemaRegistryTable - Add doc_type to is_pdp and is_edvise unique constraints to allow base, PDP, and Edvise schemas to coexist with same version - Add CheckConstraint to enforce mutual exclusivity of is_pdp and is_edvise flags Fixes Bugbot issue: Unique constraint prevented coexisting schema types for same version. The original constraints (is_pdp, version_label) and (is_edvise, version_label) prevented base schema and PDP/Edvise extensions from sharing the same version label since they all had is_pdp=False and is_edvise=False. Adding doc_type to these constraints allows proper coexistence while maintaining uniqueness guarantees. Also adds database-level enforcement that is_pdp and is_edvise cannot both be True simultaneously. * fix: resolve mypy type errors - Fix type error in institutions.py: change set to list for requested_schemas default value - Add return type annotations to all test functions in institutions_test.py - Add return type annotations to fixture functions - Add typing.Any import for fixture return types Fixes mypy errors: incompatible types in assignment and missing return type annotations. * fix: add missing type annotations to test function parameters - Add TestClient type annotations to test_create_inst_unauth, test_create_inst, test_edit_inst, and test_delete_inst Fixes mypy errors: Function is missing a type annotation for one or more arguments. * feat: Implement Phase 3 Edvise schema validation logic - Add EDVISE_SCHEMA_GROUP constant to utilities.py (mirrors PDP_SCHEMA_GROUP) - Add _edvise_cache to _ValidationState class for schema caching with TTL - Update validation_helper() to load Edvise schema when edvise_id is set - Add defensive check for mutual exclusivity (pdp_id and edvise_id cannot both be set) - Add error handling for missing Edvise schema with clear error messages - Update institution creation endpoint to use EDVISE_SCHEMA_GROUP when edvise_id is provided - Add comprehensive test suite: 15 tests covering happy path, errors, cache, authorization, and edge cases This implementation enables institutions with edvise_id to use the Edvise schema extension for file validation, following the same pattern as PDP schema validation. All changes are backwards compatible and include comprehensive test coverage (~90% of critical paths). * fix: Resolve Edvise test failures and improve test reliability - Fix type annotation error in PDP schema branch (mypy no-redef) - Change test user to DATAKINDER for multi-institution access - Fix database constraint violation in precedence test (version_label) - Simplify cache tests to verify behavior instead of implementation - Remove duplicate assertion in cache expiration test - Optimize imports in test fixture * fix: Update Edvise test filenames to include descriptive keywords - Change generic test filenames (test.csv, test_file.csv, etc.) to include 'student' keyword - This allows validation_helper to properly infer model types from filenames - Fixes ValueError: Could not infer model(s) from file name errors - Formatting will be applied by CI ruff formatter * style: Format data_test.py with ruff * fix(validation): return proper HTTP status codes for institution errors - Change ValueError to HTTPException (404) when institution not found in validation_helper - Fix test_validate_edvise_unauthorized to test actual unauthorized access instead of non-existent institution - Ensures proper HTTP status codes are returned to API clients * fix: handle filename inference errors and extension schema deactivation - Replace ValueError with HTTPException (422) for filename inference failures to return proper user-facing error instead of 500 - Deactivate existing extension schemas before inserting new ones to ensure only one active extension per institution and prevent nondeterministic queries - Add comprehensive validation error formatter with PII masking and user-friendly messages - Add integration and snapshot tests for error formatter * fix: remove unused imports from validation_error_formatter_snapshot_test - Remove unused typing imports (Any, Dict, List) - Remove unused pandera imports (DataFrameSchema, Column, Check) - Remove unused MAX_ERROR_EXAMPLES import Fixes ruff linting errors (F401) reported in CI. * fix: resolve test failures and configuration issues - Remove invalid catalog_name parameter from create_custom_schema_extension call - Restore testpaths configuration to use src directory - Add Pandera FutureWarning filter to pytest config - Fix syntax warning in databricks.py docstring - Format files with Ruff * fix: resolve Ruff and Mypy linting errors - Remove unused imports (IO, cast, tomli/tomllib) from databricks.py - Remove duplicate import re statement - Add type annotations to test cases in validation_error_formatter_test.py - Add type: ignore comments for intentional invalid type tests * fix: align database constraints with production schema and fix Edvise version_label collision - Fix uq_pdp_version constraint to match production: remove doc_type (matches actual DB schema) - Remove uq_edvise_version constraint (enforced operationally, not via DB constraint) - Update CHECK constraint to use MySQL-compatible boolean values (1/0 instead of TRUE/FALSE) - Fix Edvise test fixture to use version_label='edvise-1.0.0' to avoid uq_pdp_version collision - Add explanatory comment about version_label choice in test fixture These changes ensure the ORM matches the actual production database schema and prevent constraint violations when running tests against MySQL. * fix: handle parameterized Pandera check types in validation error formatting Fix bug where parameterized check types (e.g., "isin(['A', 'B', 'C'])", "str_length(3, None)") were not being matched to their formatters, causing generic error messages instead of human-readable ones. Changes: - Add _extract_base_check_type() to extract base type from parameterized check types (e.g., "isin(['A', 'B'])" -> "isin") - Add _normalize_check_type_alias() to map verbose Pandera names to spec keys (e.g., "greater_than" -> "gt", "greater_than_or_equal_to" -> "ge") - Update _find_check_spec() to use base type extraction and alias normalization - Update _format_check_error() to only format when matching spec is found (prevents semantic errors like formatting "greater_than" as "ge") - Add _format_gt_error() and _format_lt_error() for strict comparison checks - Preserve semantic correctness: strict comparisons (> and <) vs non-strict (≥ and ≤) Edge cases handled: - Namespaced types: "Check.isin(['A'])" -> "isin" - Empty/None/non-string inputs: returns safe empty string - Spaces around parentheses: "isin (['A'])" -> "isin" - Complex repr: "str_matches(re.compile('...'))" -> "str_matches" Testing: - Add comprehensive unit tests for base type extraction and alias handling - Add tests for parameterized check types (isin, str_length, gt, ge) - Update integration test assertion to match actual output format - Update snapshot fixtures to reflect new human-readable messages Fixes parameterized check type matching while maintaining semantic correctness for strict vs non-strict comparisons. * style: format validation_error_formatter files with ruff Auto-formatted files to comply with project formatting standards. * feat: add case-insensitive institution name lookup - Implement case-insensitive matching for GET /institutions/name/{inst_name} endpoint - Use func.lower() on both database column and input parameter for case-insensitive comparison - Update docstring to document case-insensitive behavior and error handling - Add comprehensive test cases for case-insensitive matching: - Test multiple case variations (original, title case, uppercase, mixed case) - Test lowercase input matching database entries - Test uppercase input matching lowercase database entries - Fix type error: change requested_schemas assignment from set to list for type consistency - Apply code formatting with ruff * fix: add missing return type annotations to test functions - Add Generator import from typing for fixture return types - Add return type annotations (-> None) to all test functions: - test_read_all_inst - test_read_all_inst_datakinder - test_read_inst_by_name - test_read_inst_by_name_case_insensitive - test_read_inst_by_name_case_insensitive_lowercase - test_read_inst_by_name_case_insensitive_uppercase - test_read_inst_by_pdp_id - test_read_inst - Fix fixture return types to use Generator[TestClient, None, None] - client_fixture - datakinder_client_fixture - Resolves mypy type checking errors for test file * style: apply ruff formatting to test file - Split long function signatures across multiple lines for readability - Format client_fixture and datakinder_client_fixture function signatures - Format test_read_inst_by_name_case_insensitive_lowercase and _uppercase function signatures * fix(test): update institutions test for edvise_id API changes - Remove unused typing.Any import - Update test_read_all_inst_datakinder to include edvise_id in expected response - Add edvise_test_school institution to expected response (4 institutions total) - Fix line length for pylint compliance This fixes test failures caused by API changes from develop branch that now return edvise_id and pdp_id fields for all institutions. * fix(validation): pass institution_id so Edvise/PDP/custom use correct extension block - Thread schema_namespace (edvise | pdp | inst UUID) from data router through validate_file and validate_file_reader into validate_dataset - merge_model_columns now receives correct key for extension_schema['institutions'] - Add institution_id param with default 'pdp' for backward compatibility - Add tests: assert Edvise validation passes institution_id='edvise'; add unit test that institution_id selects the right extension block (edvise vs pdp) - Expand docstrings (Args/Returns) and add comment explaining schema_namespace - Addresses reviewer Q1: schema extension logic now works for Edvise and custom institutions, not only PDP * Apply Black formatting to institutions_test.py * Apply ruff format to institutions_test.py * Fix institutions_test assert for Black and Ruff format compatibility * Fix pylint E1135 in data_test: use .get() instead of membership test on captured_schema * Apply ruff format to data_test.py * feat(validation): schema validation during upload with PDP/edvise repo alignment - Add PDP edvise schema validation path (validation_pdp_edvise) - Add Edvise-to-PDP normalization (validation_edvise_normalize) - Integrate repo schemas into validation pipeline and error formatter - Update pdp_schema_extension and lockfile; add tests Co-authored-by: Cursor <cursoragent@cursor.com> * feat(validation): write normalized data to validated/, archive raw to raw/ - On validation success: archive original to raw/{filename}, write normalized (canonical columns, coerced dtypes) DataFrame to validated/{filename}, delete from unvalidated/ - Validation layer always returns normalized_df on success; storage serializes to UTF-8 CSV and uploads to validated/ - Add input validation and helpers in gcsutil (under 50 lines); catch specific exceptions; TYPE_CHECKING for HardValidationError in validation_pdp_edvise - Add gcsutil_test.py: validate_file input/error/success paths, _run_validation_and_get_normalized_df, _write_dataframe_to_gcs_as_csv - Add validation_test: empty-schema short-circuit returns normalized_df None - Ruff/black formatting and lint fixes; mypy-clean for touched files Co-authored-by: Cursor <cursoragent@cursor.com> * refactor(validation): align with universal principles, add tests, fix types and format - Extract validation helpers to meet 50-line rule (_header_missing_and_extra, _get_csv_read_kwargs, _validate_optional_columns_json) - Extract gcsutil._archive_raw_and_write_validated; add type hints to rename_file - Add tests: PDP rename/validate_dataframe, CSV read failure, gcsutil error propagation, edvise institution_identifier in validate_file call - Remove unused validation_edvise_normalize and its tests - Fix mypy in validation_pdp_edvise and tests (Optional[List], cast, annotations) - Apply ruff format Co-authored-by: Cursor <cursoragent@cursor.com> * feat(validation): use edvise read for PDP uploads and add PDP path tests - Route PDP cohort/course through edvise read (read_raw_pdp_*); remove API-side normalizers for PDP so pipeline and API share one source of truth - Add _path_for_edvise_read, _read_pdp_course_edvise, _validate_pdp_with_edvise_read - Convert Pandera SchemaErrors to HardValidationError in PDP path - Add validation_pdp_read_path_test.py (routing, path cleanup, SchemaErrors, course converter fallback); extend Src type with io.StringIO for file-like Co-authored-by: Cursor <cursoragent@cursor.com> * move cloud build config to repo * sst-app-api -> edvise-api * quiet down sqlalchemy * use EdaSummary from edvise * use ruff formatter * test a file * tidy up * Add return type annotations for mypy in main_test and users_test * tidy up * move cache check after batch result check * fix test_execute_pdp_pull * install git * install git in correct Dockerfile * install git in worker * update edvise branch * use develop branch for edvise * install edvise in build * cloudbuild with edvise * fix(validation): resolve pylint used-before-assignment error Initialize schema_err_to_raise before try block to satisfy pylint's static analysis, which doesn't recognize that pytest.skip() always raises. Co-authored-by: Cursor <cursoragent@cursor.com> * feat(api): add legacy school type with any-format uploads - Add legacy_id to InstTable and institution API (create, update, read) - Enforce mutual exclusivity of pdp_id, edvise_id, legacy_id via has_at_most_one_school_type - Legacy validation: encoding + CSV read only, no schema checks - Add LEGACY_SCHEMA_GROUP and tests for legacy path and mutual exclusivity Made-with: Cursor * feat(api): legacy PII check, principles compliance, and test coverage - Add PII column check for legacy uploads; reject before raw/validated - Treat student_id as non-PII (false positive) for all institution types - Comply with universal principles: docstrings, extract create_institution helpers (<50 lines), comment lazy import in validation - Add tests: has_at_most_one_school_type, legacy header-only CSV, legacy PII rejection returns 400, explicit legacy_id create, update add legacy_id, storage/Databricks failure paths - Fix mypy in create_institution (row variable) Made-with: Cursor * docs(api): use Edvise Schema (ES) naming to reduce confusion Replace 'Edvise schema' with 'Edvise Schema (ES)' in docstrings, comments, and user-facing error messages so the schema type is distinguished from the Edvise product (ES convention). Made-with: Cursor * feat(data): allow legacy institutions to upload files with any filename - Fetch institution before filename inference; set allowed_schemas to UNKNOWN when inference fails for legacy (non-legacy still get 422 for non-descriptive names) - Refactor validation_helper into helpers under 50 lines; add full docstrings, early empty-filename and invalid inst_id validation, log before 404 - Add unit tests for _infer_allowed_schemas_from_filename and _ext_models_set - Add integration tests: empty filename 422, invalid inst_id 404, edvise non-descriptive filename 422, duplicate validate idempotent - Fix mypy and ruff/black in data.py and data_test.py - Add PR_DESCRIPTION.md for feature branch Made-with: Cursor * chore: remove PR_DESCRIPTION.md Made-with: Cursor * fix(validation): run PII check for header-only legacy CSVs * fix(test): align validation error snapshot with non-PII student_id display Made-with: Cursor * feat(validation): use PDP cohort converter and support custom converters - Use converter_func_cohort by default for PDP cohort validation (filters DE/DS/SE) - Add optional pdp_cohort_converter_func and pdp_course_converter_func to validate_file_reader and validate_dataset for school-specific overrides - Course validation tries custom converter first, then default handling_duplicates - Validate converter args are callable; convert converter/read failures to HardValidationError so API returns 400 with context - Add PDPConverterFunc type; extract helpers to meet 50-line and error-handling rules Made-with: Cursor * fix(validation): satisfy mypy for PDP validation and tests - Add unreachable return after with block in _validate_pdp_with_edvise_read - Use cast(Any, ...) in tests that pass non-callables to converter params Made-with: Cursor * chore: remove real institution names * chore: ruff format * fix: use latest edvise EdaSummary * fix: use edvise develop branch * chore(deps): pin edvise to develop * feat(ci): notify slack channel on deployment * fix: lock file was out of sync * chore: bump edvise version to 0.1.12 * Revert "feat: legacy school type with any-format uploads, PII check, and Edvise Schema (ES) naming" * Revert "Revert "feat: legacy school type with any-format uploads, PII check, and Edvise Schema (ES) naming"" * feat(config): add optional local inst/batch/file seed from config for LOCAL * style: ruff format * fix(validation): pass schema_type to handling_duplicates for PDP course CSV read_raw_pdp_course_data calls converter_func(df) with one argument; bare handling_duplicates is invalid on current edvise. Use a wrapper that calls handling_duplicates(df, "pdp") positionally for edvise compatibility. Remove the broken second default converter. Update PDP read path test. Made-with: Cursor * style: ruff format PDP course read path test Made-with: Cursor * fix(deps): upgrade databricks-sql-connector for pyarrow>=17 (edvise) databricks-sql-connector 3.5 pins pyarrow<17; edvise requires pyarrow>=17. Use databricks-sql-connector[pyarrow]~=4.2.x and refresh uv.lock (pyarrow 19). Aligns lock with Cloud Build 'uv lock --upgrade-package edvise'. Made-with: Cursor * Merge pull request #217 from datakind/fix/pdp-course-handling-duplicates fix: fix duplicate-handling step in validation * fix(storage): reduce peak memory during upload validation - Download unvalidated blob to a temp file and validate by path instead of blob.open().read() via _path_for_edvise_read (avoids a full in-RAM copy). - Write validated CSV to a temp file and upload_from_filename instead of building the entire CSV in a StringIO string. Branched from develop (repo has no dev branch). Made-with: Cursor * chore(storage): log errno on temp download/to_csv OSError Helps distinguish ENOSPC vs other failures in Cloud Run logs; re-raises unchanged. Made-with: Cursor * test(storage): cover temp cleanup and OSError logging for validate upload - Download OSError: unlink temp, skip validate_file_reader, log errno - to_csv OSError: unlink temp, no upload, log errno - Upload failure after to_csv: temp still unlinked Made-with: Cursor * refactor(storage): extract temp download/unlink helpers for clarity Aligns with universal-principles: keep _run_validation_and_get_normalized_df under 50 lines, reduce nesting, replace tmp_path with local_csv_path naming. Made-with: Cursor * style: apply black/ruff format to gcsutil_test.py Made-with: Cursor * fix(storage): reduce peak memory during upload validation - Download unvalidated blob to a temp file and validate by path instead of blob.open().read() via _path_for_edvise_read (avoids a full in-RAM copy). - Write validated CSV to a temp file and upload_from_filename instead of building the entire CSV in a StringIO string. Branched from develop (repo has no dev branch). Made-with: Cursor * chore(storage): log errno on temp download/to_csv OSError Helps distinguish ENOSPC vs other failures in Cloud Run logs; re-raises unchanged. Made-with: Cursor * test(storage): cover temp cleanup and OSError logging for validate upload - Download OSError: unlink temp, skip validate_file_reader, log errno - to_csv OSError: unlink temp, no upload, log errno - Upload failure after to_csv: temp still unlinked Made-with: Cursor * refactor(storage): extract temp download/unlink helpers for clarity Aligns with universal-principles: keep _run_validation_and_get_normalized_df under 50 lines, reduce nesting, replace tmp_path with local_csv_path naming. Made-with: Cursor * style: apply black/ruff format to gcsutil_test.py Made-with: Cursor * chore: bump edvise v0.2.0 * fix(pdp-validation): default cohort converter to none Stop passing edvise converter_func_cohort when pdp_cohort_converter_func is omitted so PDP cohort rows are validated as read. - Callers may still pass an explicit cohort converter. - Update PDP read-path test to expect converter_func=None. - Refresh docstrings (pipeline vs API, Args/Returns/Raises) in validation and validation_pdp_edvise. Made-with: Cursor * feat(api): remove custom institution path; require school type; legacy schemas UNKNOWN - Require exactly one of PDP, Edvise, or Legacy on POST /institutions - Remove custom schema resolution and Databricks extension generation for uploads - Fix PATCH /institutions to persist allowed_schemas to inst.schemas column - LEGACY_SCHEMA_GROUP stores UNKNOWN only; drop validation_extension module - Update tests and default fixtures for typeless/custom removal Made-with: Cursor * feat(api): harden institutions API after custom-institution removal - POST/PATCH: require exactly one school type (pdp, edvise, or legacy) - PATCH: recompute schemas only when the type triple changes; merge optional allowed_schemas on change - PATCH: honor is_edvise/is_legacy for auto-assigned ids (POST parity) - Docs/tests: validation namespaces; disambiguate custom naming in code and tests Made-with: Cursor * docs(api): revert broad custom wording; keep upload docs accurate Restore original docstrings and test names where "custom" referred to\nconverters, schema config, or JSON keys—not custom institutions.\n\nKeep gcsutil validate_file institution_id line aligned with pdp/edvise/legacy\nonly (no institution-UUID-for-custom upload path). Made-with: Cursor * fix(institutions): reject POST duplicate when existing row lacks school type When (name, state) matches an existing InstTable row, validate stored\npdp_id/edvise_id/legacy_id the same as new creates: at most one non-null\nand exactly one required. Return 400 with guidance instead of 200 for\ntypeless or invalid rows. Add regression tests. Made-with: Cursor * test(institutions): cover duplicate POST, PATCH flags, allowed_schemas-only - Reject is_pdp without pdp_id on POST\n- Reject duplicate (name, state) when stored row has conflicting ids\n- Reject PATCH is_edvise on PDP row without clearing pdp_id\n- Reject PATCH with both is_edvise and is_legacy\n- allowed_schemas-only PATCH replaces schemas when type unchanged Made-with: Cursor * refactor(institutions): extract PATCH helpers and DRY school-type errors - Add shared mutual-exclusion detail constant for POST/PATCH paths - Extract duplicate-post row validation and PATCH merge/validate/persist helpers - Keep update_inst within single-responsibility helpers; reuse row response mapper Made-with: Cursor * fix(lint): satisfy ruff and mypy on databricks and institutions - Remove unused HTTPException import from databricks.py (F401) - Cast ORM row in _require_single_institution_row_by_uuid for InstTable (no-any-return) Made-with: Cursor * style(institutions): apply ruff format to router and tests Made-with: Cursor * refactor: simplify local_inst_data * docs: Update local_inst_data instructions * chore: remove unused import * fix: make pdp_id and state optional * chore: bumping pyproject and uv.lock --------- Co-authored-by: Mesh <meshach.ogunmodede@datakind.org> Co-authored-by: Meshach Ogunmodede <142531479+Mesh-ach@users.noreply.github.com> Co-authored-by: William Carr <bill.carr@datakind.org> Co-authored-by: Hannah Ofstedahl <98632391+chapmanhk@users.noreply.github.com> Co-authored-by: Hannah Ofstedahl <hannahxchapman@yahoo.com> Co-authored-by: Cursor <cursoragent@cursor.com> Co-authored-by: William Carr <bill@datakind.org> Co-authored-by: Vishakh Pillai <vishpillai97@gmail.com> Co-authored-by: kaylawilding <95330483+kaylawilding@users.noreply.github.com> * Revert "feat: consolidating staging into main and using main going forward as…" (#236) This reverts commit 9b70f238333796f7d9835d5ac5e1c81ee66d11c6. * Merge develop into main (#240) * docs: inherit org community health files (#237) * docs: remove local community health files to inherit from org-wide .github repo * docs: update README to include previous contributing info * feat(api): simplify create model request to name only (#238) * chore: bump edvise dependency to 1.0.0 (#241) Co-authored-by: Vishakh Pillai <vishpillai97@gmail.com> * chore: creating dummy changlog.md file while we create semver / gitflow process --------- Co-authored-by: Rachel Wells <rachellaurynwells@gmail.com> Co-authored-by: William Carr <bill@datakind.org> Co-authored-by: Vishakh Pillai <vishpillai97@gmail.com> * chore(release): edvise-api 1.0.0 (#249) * docs: inherit org community health files (#237) * docs: remove local community health files to inherit from org-wide .github repo * docs: update README to include previous contributing info * feat(api): simplify create model request to name only (#238) * chore: bump edvise dependency to 1.0.0 (#241) Co-authored-by: Vishakh Pillai <vishpillai97@gmail.com> * chore: creating dummy changlog.md file while we create semver / gitflow process * feat: trigger Databricks GCS→bronze sync after file validation (Edvise/Legacy) (#239) * Trigger GCS→bronze Databricks sync after validation (Edvise/Legacy) - Add run_validated_gcs_to_bronze_sync and job edvise_validated_gcs_to_bronze_sync with include_blob_paths_json for validated/{file_name}. - Call after successful validate-upload / validate-sftp when edvise_id or legacy_id is set; ENABLE_GCS_BRONZE_SYNC_ON_VALIDATION (default true) to disable. - Failures to start the job are logged and do not fail validation. - Extend data tests with DatabricksControl mock and assertions. Made-with: Cursor * feat(data): bronze sync after validation with tracing and job id resolution Schedule GCS-to-bronze Databricks run_now in BackgroundTasks after validated/ writes. Add correlation_id and JSON trace logs (validation_request, background start/done). Optional DATABRICKS_VALIDATED_BRONZE_SYNC_JOB_ID; resolve job by name with duplicate detection when unset. Refine skip reasons for PDP vs Edvise/Legacy. Co-authored-by: Cursor <cursoragent@cursor.com> * refactor(data): align bronze sync with universal principles Extract Databricks helpers and job-parameter constants, use specific exceptions (ValueError, DatabricksError), and split background logging into focused functions under 50 lines. Add tests for PDP-only and env kill-switch skips plus run_now parameter contract coverage. Co-authored-by: Cursor <cursoragent@cursor.com> * style: apply ruff format to bronze sync modules Co-authored-by: Cursor <cursoragent@cursor.com> * fix(data): resolve prefixed bronze sync Databricks jobs Co-authored-by: Cursor <cursoragent@cursor.com> * fix(data): map bronze sync job ids by environment Co-authored-by: Cursor <cursoragent@cursor.com> * fix(data): trigger bronze sync during validation request Co-authored-by: Cursor <cursoragent@cursor.com> --------- Co-authored-by: Cursor <cursoragent@cursor.com> * feat(api): validate Edvise uploads with repo schemas (#242) * feat(api): validate Edvise uploads with repo schemas Route Edvise student and course uploads through upstream edvise Pandera schemas so upload validation matches the pipeline contract and is not bypassed by registry schema drift. Co-authored-by: Cursor <cursoragent@cursor.com> * refactor(api): separate Edvise validation routing Keep PDP and Edvise repo-schema upload validation paths distinct so each helper has a single responsibility while preserving the same validation behavior. Co-authored-by: Cursor <cursoragent@cursor.com> * refactor(api): remove redundant repo validation fallback Keep JSON validation flow focused now that PDP and Edvise repo-schema uploads are routed before schema merging. Co-authored-by: Cursor <cursoragent@cursor.com> * style(api): format validation routing test Apply Ruff formatting to keep the Edvise validation routing tests passing style checks. Co-authored-by: Cursor <cursoragent@cursor.com> --------- Co-authored-by: Cursor <cursoragent@cursor.com> * feat(webapp): establish pyproject.toml as canonical Edvise API version (#243) * feat(webapp): establish pyproject.toml as canonical Edvise API version, set version and title in OpenAPI * docs: rename SST -> Edvise * docs(webapp): update formatter instructions to ruff to align with github workflows and engineering playbook * test(webapp): assert OpenAPI version matches pyproject.toml * feat(eda): add clear_cache option to /eda endpoint (#233) * feat: legacy school inference DB job trigger (#212) * feat: custom school inference, but need to confirm if custom is the same as legacy * fix: transitioning from 'custom' to 'legacy' * fix: remove validation of job parameters, handled already through edvise * fix: run request still requires str values, defaulting to empty string * fix: still getting pydantic error * feat: using substring matching to find legacy job since i have it deployed under my name because of target==dev * fix: style * fix: style * fix: style * fix: making batch file name more robust so we don't run into decoding issues * fix: merge conflict * fix: merge conflict --------- Co-authored-by: Vishakh Pillai <vishpillai97@gmail.com> * feat: add gen ai as 4th option in addition pdp / edvise / legacy in api and uploads (#244) * feat: Added "GenAI" as an option for "create institution" note: for GenAI raw files, we will reuse the same loose rules as Legacy institutions (read CSV, PII check, no strict ES columns). * fix: style --------- Co-authored-by: Noreen Mayat <nm3224@alum.barnard.edu> Co-authored-by: Vishakh Pillai <vishpillai97@gmail.com> * fix(models): derive PDP batch schema configs from institution schemas (#247) * fix(models): derive PDP batch schema configs from institution schemas When model.schema_configs is null, PDP inference now builds a default required COURSE+STUDENT (etc.) batch rule from inst.schemas instead of 500ing. Explicit model configs still take precedence. Co-authored-by: Cursor <cursoragent@cursor.com> * fix(models): cast jsonpickle decode for mypy no-any-return Co-authored-by: Cursor <cursoragent@cursor.com> * fix(databricks): prefer Cloud Run job when pipeline name is ambiguous When multiple dev bundle jobs match a PDP or legacy inference pipeline substring, resolve to [dev dev_cloudrun_sa] if present; otherwise use the first sorted match instead of failing the inference request. Co-authored-by: Cursor <cursoragent@cursor.com> --------- Co-authored-by: Vishakh Pillai <vishpillai97@gmail.com> Co-authored-by: Cursor <cursoragent@cursor.com> * ci: datakind shared workflows (#245) * ci: datakind shared workflows * refactor: rename test.yml -> tests.yml * fix(ci): add workflow_call to style and tests workflows * refactor: use pre-release workflow from shared workflows * ci: replace with shared enforce-pr-targets workflow Aligns checks against the current protected branches, main and develop, rather than staging * chore: remove unused workflow * refactor: remove pull_request triggers. These run via ci.yml * ci: pin tests and type-check to Python 3.13 * chore(ci): remove legacy webapp-and-worker precommit workflow * ci: standardize on Python 3.12 across workflows and pyproject * ci: test workflow enforcement * ci: test workflow enforcement * ci: add gate job to report required ci status check * chore: bump python version to 3.10 * chore: standardize Python 3.12 across project and Docker * chore: updating edvise v1.2.0 * chore: CHANGELOG.md update + type check * chore(release): bump version * ci(cloudbuild): parameterize webapp deploy for multi-environment triggers --------- Co-authored-by: Rachel Wells <rachellaurynwells@gmail.com> Co-authored-by: Vishakh Pillai <64162993+vishpillai123@users.noreply.github.com> Co-authored-by: Vishakh Pillai <vishpillai97@gmail.com> Co-authored-by: Hannah Ofstedahl <98632391+chapmanhk@users.noreply.github.com> Co-authored-by: Cursor <cursoragent@cursor.com> Co-authored-by: Noreen Mayat <nm3224@alum.barnard.edu> --------- Co-authored-by: Meshach Ogunmodede <142531479+Mesh-ach@users.noreply.github.com> Co-authored-by: Hannah Ofstedahl <98632391+chapmanhk@users.noreply.github.com> Co-authored-by: Hannah Ofstedahl <hannahxchapman@yahoo.com> Co-authored-by: Mesh <meshach.ogunmodede@datakind.org> Co-authored-by: William Carr <bill.carr@datakind.org> Co-authored-by: Cursor <cursoragent@cursor.com> Co-authored-by: William Carr <bill@datakind.org> Co-authored-by: Vishakh Pillai <vishpillai97@gmail.com> Co-authored-by: kaylawilding <95330483+kaylawilding@users.noreply.github.com> Co-authored-by: Rachel Wells <rachellaurynwells@gmail.com> Co-authored-by: Noreen Mayat <nm3224@alum.barnard.edu> * refactor(validation): remove API JSON schema validation from upload pipeline (#246) * refactor(validation): remove API JSON schema validation Route upload validation through institution namespaces and the edvise repo Pandera schemas instead of API-local JSON schema documents. Co-authored-by: Cursor <cursoragent@cursor.com> * test(validation): update upload validation coverage Cover repo-backed PDP and Edvise upload validation, legacy handling, and unsupported model errors after removing the JSON fallback. Co-authored-by: Cursor <cursoragent@cursor.com> * fix(data): repair GenAI upload validation and enable bronze sync Correct a merge regression where legacy/GenAI institutions returned a tuple from _resolve_schema_namespace, include GenAI in GCS→bronze sync, and add upload validation test coverage for GenAI schools. Co-authored-by: Cursor <cursoragent@cursor.com> * ci: temporarily use new asana shared workflow * ci: use shared asana task link from @main branch --------- Co-authored-by: Cursor <cursoragent@cursor.com> Co-authored-by: Vishakh Pillai <64162993+vishpillai123@users.noreply.github.com> Co-authored-by: William Carr <bill@datakind.org> * feat: hook up Edvise Schema (ES) inference to api (#253) * feat: hook up Edvise Schema (ES) inference to api, so we can run it from webapp like PDP; and enhance institution type handling - Updated `trigger_inference_run` to include support for Edvise Schema (ES) and GenAI alongside existing PDP and Legacy types. - Enhanced mutual exclusivity check to include `genai_id`. - Introduced `run_es_inference` method in `DatabricksControl` for triggering ES inference jobs. - Updated error messages and validation checks to reflect the new schema options. - Added tests to ensure proper handling of Edvise institutions and inference logic. * fix: style * feat: renaming "legacy_model_result" to just "model_result" to encompass both legacy and edvise school results * feat: renaming from "DatabricksLegacyInferenceRunRequest" to "DatabricksSharedInferenceRunRequest" also renamed model_result to shared_model_result for consistency * fix: removing part of comment that's irrelevant * fix: lint * fix: removing genai from error message * feat: creating `is_genai_institution` parameter to feed into genai/edvise inference job for SSoT (#254) * feat: using `batch_id` parameter for subfolder naming convention during GCS to DB bronze async job (#256) * feat: use batch parameters for run-inference endpoint for genAI/Edvise/Legacy schools (#257) * docs(db): add staging-verified shared schema contract for UI and API (#258) * docs: add shared database schema contract for UI and API tables Publish canonical ownership and column definitions for users and job plus inventory of UI-only and API-only tables to support Phase 0 migration split work. Co-authored-by: Cursor <cursoragent@cursor.com> * docs: verify schema contract against staging all_tables DDL Align users and job canonical columns with staging SHOW CREATE TABLE exports (2026-06-24). Document FK on users.inst_id, skip job ALTER on staging, and exclude backup tables from Alembic scope. Co-authored-by: Cursor <cursoragent@cursor.com> * docs: document shared users and job tables in API README (#259) Link DB_SCHEMA_CONTRACT.md and document migration ownership plus greenfield bootstrap order for the shared Cloud SQL database. Co-authored-by: Cursor <cursoragent@cursor.com> --------- Co-authored-by: Cursor <cursoragent@cursor.com> * feat(api): expose model_run_id and model_version on RunInfo endpoints (#260) * docs: add shared database schema contract for UI and API tables Publish canonical ownership and column definitions for users and job plus inventory of UI-only and API-only tables to support Phase 0 migration split work. Co-authored-by: Cursor <cursoragent@cursor.com> * docs: verify schema contract against staging all_tables DDL Align users and job canonical columns with staging SHOW CREATE TABLE exports (2026-06-24). Document FK on users.inst_id, skip job ALTER on staging, and exclude backup tables from Alembic scope. Co-authored-by: Cursor <cursoragent@cursor.com> * docs: document shared users and job tables in API README Link DB_SCHEMA_CONTRACT.md and document migration ownership plus greenfield bootstrap order for the shared Cloud SQL database. Co-authored-by: Cursor <cursoragent@cursor.com> * feat(api): expose model_run_id and model_version on RunInfo endpoints Include training run identifiers on list-runs and single-run responses so the UI can stop falling back to direct job table reads. Co-authored-by: Cursor <cursoragent@cursor.com> --------- Co-authored-by: Cursor <cursoragent@cursor.com> * feat(api): mirror accepted_terms and invite_validated on AccountTable (#263) Keep users ORM aligned with Laravel migrations for shared-table contract compliance; no DDL change (columns owned by edvise-ui). Co-authored-by: Cursor <cursoragent@cursor.com> * fix: sync ES pipeline rename & allow greater flexibility with legacy/GenAI/ES uploads (#262) * fix: rename of es pipeline * fix: schema fallback for genai & legacy institutions * fix: allow for any non-empty batch regardless of per-file schema tags * fix: adding a few PII false positive patterns * fix: make PII more flexible and stop with the false positives --------- Co-authored-by: Vishakh Pillai <vishpillai97@gmail.com> * fix(api): coerce Databricks model version to str for RunInfo responses (#264) PR #260 added model_version as a string on RunInfo, but Databricks returns version as int, causing ResponseValidationError on run-inference for all school types. Co-authored-by: Vishakh Pillai <vishpillai97@gmail.com> Co-authored-by: Cursor <cursoragent@cursor.com> * chore(release): bump version * chore(release): update CHANGELOG * chore: bump edvise from 1.2.0 to 1.4.0 --------- Co-authored-by: Rachel Wells <rachellaurynwells@gmail.com> Co-authored-by: Vishakh Pillai <64162993+vishpillai123@users.noreply.github.com> Co-authored-by: Vishakh Pillai <vishpillai97@gmail.com> Co-authored-by: Hannah Ofstedahl <98632391+chapmanhk@users.noreply.github.com> Co-authored-by: Cursor <cursoragent@cursor.com> Co-authored-by: Noreen Mayat <nm3224@alum.barnard.edu> Co-authored-by: Meshach Ogunmodede <142531479+Mesh-ach@users.noreply.github.com> Co-authored-by: Hannah Ofstedahl <hannahxchapman@yahoo.com> Co-authored-by: Mesh <meshach.ogunmodede@datakind.org> Co-authored-by: William Carr <bill.carr@datakind.org> Co-authored-by: kaylawilding <95330483+kaylawilding@users.noreply.github.com>

chapmanhk and others added 4 commits April 28, 2026 15:09

style: apply ruff format to bronze sync modules

cdb0a14

Co-authored-by: Cursor <cursoragent@cursor.com>

chapmanhk changed the title ~~Feature/gcs bronze sync databricks api~~ feat: trigger Databricks GCS→bronze sync after file validation (Edvise/Legacy) May 27, 2026

chapmanhk and others added 3 commits May 27, 2026 20:36

fix(data): resolve prefixed bronze sync Databricks jobs

639a815

Co-authored-by: Cursor <cursoragent@cursor.com>

fix(data): map bronze sync job ids by environment

e6f89e0

Co-authored-by: Cursor <cursoragent@cursor.com>

fix(data): trigger bronze sync during validation request

585b725

Co-authored-by: Cursor <cursoragent@cursor.com>

chapmanhk marked this pull request as ready for review May 28, 2026 05:27

chapmanhk requested a review from vishpillai123 May 28, 2026 05:28

vishpillai123 approved these changes May 28, 2026

View reviewed changes

chapmanhk merged commit 7066397 into develop May 29, 2026
6 checks passed

chapmanhk deleted the feature/gcs-bronze-sync-databricks-api branch May 29, 2026 15:37

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: trigger Databricks GCS→bronze sync after file validation (Edvise/Legacy)#239

feat: trigger Databricks GCS→bronze sync after file validation (Edvise/Legacy)#239
chapmanhk merged 7 commits into
developfrom
feature/gcs-bronze-sync-databricks-api

chapmanhk commented May 18, 2026 •

edited

Loading

Uh oh!

vishpillai123 left a comment

Uh oh!

vishpillai123 commented May 28, 2026 •

edited

Loading

Uh oh!

chapmanhk commented May 29, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

chapmanhk commented May 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

feat(data): trigger Databricks GCS→bronze sync after file validation (Edvise/Legacy)

Description

Deployment Readiness*

Testing

Deployment Notes

Rollback Plan

Reviewer Guidance / Questions*

Screenshots / Testing Evidence*

SOC 2 Change Management Checklist

Uh oh!

vishpillai123 left a comment

Choose a reason for hiding this comment

Uh oh!

vishpillai123 commented May 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

chapmanhk commented May 29, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

chapmanhk commented May 18, 2026 •

edited

Loading

vishpillai123 commented May 28, 2026 •

edited

Loading